Our client, a hypothetical pharmaceutical company, is looking to understand better what the data related to individuals (clients and potential clients) and various health conditions and miscellaneous attributes. The goal is to extract meaningful information that could guide future research and assist with the company rapidly expanding business and market share while focusing on and improving the wellbeing of the clients.
Individual - A person who has been surveyed by the NHMS (National Health Measurement Study) dataset for various attributes related to the following: demographics, examinations, dietary, questionnaire (medical history), and medication.
Health Conditions - Various diseases or ailments that people may inhibit, such as sleep disorders, diabetes, oral health, cholesterol.
The National Health and Nutrition Examination Survey (NHANES) - A program of studies designed to assess the health and nutritional status of adults and children in the United States.
The data gettered is spread into six distinct files (CSV format): Demographics, Examinations, Dietary, Laboratory, Questionnaire, and Medication.
Our client wants to develop new drugs that primary intent to improve the quality of life of the individuals survived. The company is interested as to whether existing data on subjects and their associated health conditions could provide advice and insight to their researcher. They have obtained the NHANES dataset and requested our assistance to perform the intended analysis. This dataset contains individuals data along with various information, including health conditions.
The company is interested in developing new drugs for the following health conditions: diabetes, hypertension (blood pressure), and cancer.
The company, aware of our Machine Learning skills, approached us for help on the following problems:
With the healthcare dataset, the business has noted there are 1000s of attributes within the data. There are also many missing values throughout the data. The business has lot of old trial data and would like to enrol more patients with their diabetes drugs, but they don’t want spend too much on findings new candidates. They are unsure which attributes are the most meaningful in relation to diabetes. They also be in the middle of cancer trials and are looking for future possible referrals for their diabetes trials.
Could there be a smaller subset of data could help tell who has diabetes? Because data collection could be refined to only capture those elements.
And is there insights that could be gained from the in the demographics data in relation to diseases.
The company would ask us about possible wraping the model as a robust, easy to use App that could be present to managment and corporate to assist with the decision making, based on a few user inputs.
The marketing department is struggling with high costs of television advertisements and is interested in ways to reduce their costs while still hitting their target markets for both the advertisement of drugs and attracting candidates for trails.
To address the first business problem, we will apply supervised and unsupervised machine learning. From 1000s attribute across the dataset, we will flatten (PCA) the dataset and apply supervised machine learning algorithms to predict who has diabetes(a); this will hopefully improve the phramedical’s capability to find referrals using current or past trial data. The key will be to use as few attributes as possible in order to maximize its portability. Secondly, we will use an unsupervised clustering approach on the demographics data to explore whether the data shows any significant findings for the company (b).
The second business problem involves using “health condition” features and finding related features. We will apply 2 types of unsupervised machine learning approaches to address this problem. Firstly, we will use an association learning methods to discover what attributes are associated with health conditions (a). We will borrow an approach that is traditionally used for market basket analysis.
Secondly, we will use machine learning unsupervised clustering techniques to look for meaningful insights in the data (b).
Questions to consider during work:
If this is the case, we need to find clusters of subjects that segregate the data by health conditions and report these findings to the business.
If this is the case, we need to find clusters of subjects that segregate the data by health conditions and report these findings to the business.
For the our problems, we will need to see which attributes are tied to the “health condition” features. In order to achieve this, we are assuming that the following columns/features of the Questionnaire dataset indicate that an individual has a “health condition”:
DIQ010 - Doctor told you have diabetes https://wwwn.cdc.gov/Nchs/Nhanes/2013-2014/DIQ_H.htm The next questions are about specific medical conditions. {Other than during pregnancy, {have you/has SP}/{Have you/Has SP}} ever been told by a doctor or health professional that {you have/{he/she/SP} has} diabetes or sugar diabetes?
BPQ020 - Ever told you had high blood pressure https://wwwn.cdc.gov/Nchs/Nhanes/2013-2014/BPQ_H.htm {Have you/Has SP} ever been told by a doctor or other health professional that {you/s/he} had hypertension, also called high blood pressure?
MCQ220 - Ever told you had cancer or malignancy https://wwwn.cdc.gov/Nchs/Nhanes/2013-2014/MCQ_H.htm#MCQ220 {Have you/Has SP} ever been told by a doctor or other health professional that {you/s/he} had cancer or a malignancy (ma-lig-nan-see) of any kind?
library(plyr)
library(dplyr)
library(tidyr)
library(ggplot2)
library(knitr)
library(mice)
library(scales)
library(randomForest)
library(psych)
library(factoextra)
library(RColorBrewer)
library(caret)
library(plotly)
library(scales)
library(AMR)
As indicated earlier, the dataset consists of six raw data files: Demographics, Examinations, Dietary, Laboratory, Questionnaire, and Medication. The largest dataset, in terms of attributes, contains 953 variables, while the smallest one contains 47 variables.
Because this is a large amount of data, with over a thousand attributes cumulatively, we decided to employ the following guidelines to reduce the complexity of the data:
Ideally, we would like to analyze and impute every attribute with missing values, but in this situation, it may not be practical due to the large volume of missing data.
It is always essentialto check for missing values and consider how to addreess them in the model.
We decided to represent the Demographic and Diet datasets as they are mostly complete.
We found that the percentage of missing data in four of the six spreadsheets is very significant. Almost all attributes/columns have varying degrees of missing values.
As per our guidelines, we will select attributes/columns of interest based on our business/personal judgements. The full NHANES data dictionary/variable list is available at the following URL:
https://wwwn.cdc.gov/nchs/nhanes/continuousnhanes/default.aspx?BeginYear=2013
We first remove the variables having near zero variance in the dataset.Later we will remove the variables having more that 25% missing values in the dataset for Demographics.
We will now refer to our Dictionary for making a reference dataframe to differentiate between different forms of variables in a fast and effective way:
Categorization of variables
We have to now enter categorization of Factor/Numeric/ ‘Computation not required’ in the excel file generated
* Only to be done in 3rd column…
* Code is….
* 0 = Factor requiring no computation.
* 1 = Numeric requiring computation.
* 2 = Factor requiring computation.
* Please write Column name for the category as “Cat”
Reading Index again
Now we prepare the dataset for impute from all the information.
We have selected the following 8 relevant columns among the 32 that have less than 25% of missing values:
Now we will label the dataset for visualizations.
First we would remove all the Near Zero Variance features from the data set, Cutt off being 45% :
Now, we will remove the features having a missing values of more that 25% as decided before:
We have selected the following 69 relevant columns among the 88 that have less than 25% of missing values:
We will now refer to our Dictionary for making a reference dataframe to differentiate between different forms of variables in a fast and effective way:
Categorization of variables
We have to now enter categorization of Factor/Numeric/ ‘Computation not required’ in the excel file generated
* Only to be done in 3rd column…
* Code is….
* 0 = Factor requiring no computation.
* 1 = Numeric requiring computation.
* 2 = Factor requiring computation.
* Please write Column name for the category as “Cat”
Reading Index again
Now we prepare the dataset for impute from all the information.
Labeling the dataset:
First we would remove all the Near Zero Variance features from the data set, Cutt off being 45% :
Now, we will remove the features having a missing values of more that 25% as decided before:
We have selected the following 12 relevant columns among the 105 that have less than 25% of missing values:
We will now refer to our Dictionary for making a reference dataframe to differentiate between different forms of variables in a fast and effective way:
Categorization of variables
We have to now enter categorization of Factor/Numeric/ ‘Computation not required’ in the excel file generated
* Only to be done in 3rd column…
* Code is….
* 0 = Factor requiring no computation.
* 1 = Numeric requiring computation.
* 2 = Factor requiring computation.
* Please write Column name for the category as “Cat”
Reading Index again
Now we prepare the dataset for impute from all the information.
Labeling the dataset:
First we would remove all the Near Zero Variance features from the data set, Cutt off being 45% :
Now, we will remove the features having a missing values of more that 25% as decided before:
We have selected the following 9 relevant columns among the 46 that have less than 25% of missing values:
We will now refer to our Dictionary for making a reference dataframe to differentiate between different forms of variables in a fast and effective way:
Categorization of variables
We have to now enter categorization of Factor/Numeric/ ‘Computation not required’ in the excel file generated
* Only to be done in 3rd column…
* Code is….
* 0 = Factor requiring no computation.
* 1 = Numeric requiring computation.
* 2 = Factor requiring computation.
* Please write Column name for the category as “Cat”
Reading Index again
Now we prepare the dataset for impute from all the information.
Labeling the dataset:
First we would remove all the Near Zero Variance features from the data set, Cutt off being 45% :
Now, we will remove the features having a missing values of more that 32% as decided before:
All of the columns had more than 25% missing values. Among the 8 columns with less than 32% of missing value we have selected the following 5 relevant columns:
We will now refer to our Dictionary for making a reference dataframe to differentiate between different forms of variables in a fast and effective way:
Categorization of variables
We have to now enter categorization of Factor/Numeric/ ‘Computation not required’ in the excel file generated
* Only to be done in 3rd column…
* Code is….
* 0 = Factor requiring no computation.
* 1 = Numeric requiring computation.
* 2 = Factor requiring computation.
* Please write Column name for the category as “Cat”
Reading Index again
Now we prepare the dataset for impute from all the information.
Labeling the dataset:
First, we will remove the near zero vairiance variables.
Now, we will remove the features having a missing values of more that 25% as decided before:
We have selected the following 38 relevant columns among the 79 that have less than 25% of missing values:
We will now refer to our Dictionary for making a reference dataframe to differentiate between different forms of variables in a fast and effective way:
Categorization of variables
We have to now enter categorization of Factor/Numeric/ ‘Computation not required’ in the excel file generated
* Only to be done in 3rd column…
* Code is….
* 0 = Factor requiring no computation.
* 1 = Numeric requiring computation.
* 2 = Factor requiring computation.
* Please write Column name for the category as “Cat”
Reading Index again
Now we prepare the dataset for impute from all the information.
Now we label and save the data set:
Perform visualization against the clean datasets and the union of the cleaned datasets
Visuals against the cleaned dataset
This graph shows the number of proxy users we have in our database:
This graph shows the number of proxy users having Diabetes:
Our samples is pretty representative of the US population:
Visuals against the cleaned dataset
Visuals against the cleaned dataset
Visuals against the cleaned dataset
Visuals against the cleaned dataset
Visuals against the cleaned dataset
First, our target attributes need to be added to a dataset.
As part of the business problem, we focusing on 3 targets(diabetes, hypertension, cancer):
Given an individual has diabetes, predict individual has cancer or hypertension. Use the less amount of data possible to keep costs low.
Marking data for Diabetes
We now will keep the associated features related to Diabetes disease using PCA and Correlation plots.
Correlation Plot:
PCA
We notice is that the first 10 components has an Eigenvalue >1 and explains almost 80% of variance. So if wereduce dimensionality from 35 to 10 we will lose 20% of variance!
The two first components explains only 30% of the variance. We need 18 principal components to explain more than 95% of the variance and 27 to explain more than 0.99 Based on the analysis for Correlation and PCA, we decide to keep the below seleced variables.
We now will keep the associated features related to Diabetes disease using PCA and Correlation plots.
Correlation Plot:
PCA
We notice is that the first 24 components has an Eigenvalue >1 and explains almost 90% of variance. So if wereduce dimensionality from 87 to 24 we will lose 10% of variance!
The two first components explains only 35% of the variance. We need 27 principal components to explain more than 95% of the variance and 35 to explain more than 0.99 Based on the analysis for Correlation and PCA, we decide to keep the below 13 selected variables.
We now will keep the associated features related to Diabetes disease using PCA and Correlation plots.
Correlation Plot:
PCA
We notice is that the first 14 components has an Eigenvalue >1 and explains almost 75% of variance. So if we reduce dimensionality from 97 to 14 we will lose 25% of variance!
The two first components explains only 40% of the variance. We need 35 principal components to explain more than 95% of the variance and 42 to explain more than 0.99 Based on the analysis for Correlation and PCA, we decide to keep the below selected 31 variables.
We now will keep the associated features related to Diabetes disease using PCA and Correlation plots.
Correlation Plot:
PCA
We notice is that the first 24 components has an Eigenvalue >1 and explains almost 70% of variance. So if we reduce dimensionality from 77 to 24 we will lose 20% of variance!
The two first components explains only 20% of the variance. We need 22 principal components to explain more than 80% of the variance and 37 to explain more than 0.99 Based on the analysis for Correlation and PCA, we decide to keep the below selected 19 variables.
We will run the MFA to find relation among features for data reduction.
We notice is that the component RXDUSE explains almost 75% of variance on 5 other components. So if we reduce dimensionality from 9 to 1 we will lose 25% of variance! We ananlyised each feauture in each dimension and found that the only feature having greater vairiance is RXDUSE.
We now will keep the associated features related to Diabetes disease using PCA and Correlation plots.
Correlation Plot:
PCA
We notice is that the first 22 components has an Eigenvalue >1 and explains almost 70% of variance. So if we reduce dimensionality from 75 to 10 we will lose 30% of variance!
The two first components explains only 35% of the variance. We need 35 principal components to explain more than 95% of the variance and 38 to explain more than 0.99 Based on the analysis for Correlation and PCA, we decide to keep the below selected 15 variables.
We now will keep the associated features related to Diabetes disease using PCA and Correlation plots.
Correlation Plot:
PCA
select features correlated to the TARGET ( HAS_DIABETES) with “abs(coefficiant) > 0.1”
We notice is that the first 9 components has an Eigenvalue >1 and explains almost 80% of variance. So if wereduce dimensionality from 35 to 8 we will lose 20% of variance!
The two first components explains only 30% of the variance. We need 18 principal components to explain more than 95% of the variance and 27 to explain more than 0.99
We are going to create a training and test set of these data:
Let’s try Logistic Regression:
Logistic Regression with pca:
Let’s try random forest:
Random forest with pca
Let’s try KNN model
Let’s compare the models and check their correlation:
Most of the models have a low variability with respect of the processed sample. Random Forest (RF, PCA_RF, and CORR_RF) achieve a great auc with a very low variability.
Let’s remember how these models result with the testing dataset. Prediction classes are obtained by default with a threshold of 0.5 which could not be the best with an unbalanced dataset like this.
The best results for Sensitivity (detection of diabetes) is the Random forest with the top five correlated features, and The with PCA has a great F1 score.
We have found Random forest with the top five features correlated to the TARGET ( HAS_DIABETES) model preprocessed data with good results over the test set. This model has a sensibility of 0.997 with a F1 score of 0.998.
The ShinyApp was built to assist to predict a patients condition based on the selected attribues.
From the above graphs, we notice is that the first 4 components has an Eigenvalue >1 and explains almost 60% of variance! We can not effectively reduce dimensionality from 8 to 4 becuase we will lose about 40% of variance!
With just use the first two components, no diseases present separation between sick and healthy people . This clearly indicate that the we can not do classification base only on the demographics data.
From the above graph, we conclude that 6 is the appropriate number of clusters since it seems to be appearing at the bend in the elbow plot.
Now, let us take k = 6 as our optimal cluster
From the above visualization, we observe that in the clusters distribution both Male and female have almost the same range of age
Find associations with diseases and diet/demographics data as per business problem.
Associating mining if often used with market basket analysis. However, for healthcare dataset used NHANES, we will explore the associations between the data and attempt to provide value to addressing marketing business problems for the pharmedical company in adversiting their drugs and attracting individuals to clinical trails.
Our first task is to prepare the data for associating mining algorithms.
Since the associations rules will reference the values of the attributes. If a value says “Yes”, it might be ambigious what this means. However, if the value was, “US Citizen”, then the meaning would be precise. Below are a couple of examples where, we have re-coded the values for attributes as shown below:
The above recoding was performed for 18 attributes. Within the association dataset, we selected 18 attributes. We focused on attributes that were categorial values. For the purpose of association mining, numerical values may not add value unless they are binned into categories. For now, we have focused on 18 attributes that were available in the cleaned dataset. Sincer, the dataset is rich with many attributes. In the future, more attributes could be added into association mining algorithms if the business finds value in the suggestions of this type of analysis.
In order to apply association algorithms, the dataset has to transformed into a tranactional dataset. First, we need to merge all categorical values requiring for mining into a single description attriubte:
Now data is prepared, we can apply the association algorithms.
First, we create association rules against the dataset.
We plot the 20 most frequent values found within the data.
Per the above, as expected, US citizen, right-handed, born in US are some of the most frequent values. Also, it is also that the values for not having diseases is also at the top of the list.
OVer 400,000 rules are produced for entire data, let’s take a glance at 5 of them below.
In the above output, we can see different association mining rules for the entire dataset. The rules have LHS and RHS which demonstrate the relation between itemsets(collections of values). The items on LHS are associated and occur with the single item on the RHS. Now we will proceed to create association rules for having and not having the particular diseases (cancer, diabetes, hypertension). The RHS will be set to the particular health conditions/disases. And we will observe what typse of associations are discovered on the LHS.
In order to produce a list of association rules, we had to experiement with “conf”(confidence) parameter. For example, with positive cancer rules, we had to lower the confidence to 0.4 to produce mining rules. For each health condition(disaease),we have created 2 sets of rules. The first set of rules allow larger number of items to be produced on the LHS (maxlen=15); whereas, the second set of rules forces the rules to have a small amount of rules (maxlen=3).
For cancer association rules, we will examine both large and small items found in conjunction with an individual having cancer.
First, we inspected the rules where individual has cancer and observed which large itemsets occur in conjunction with cancer. The confidence level was set 0.4 for this set of rules which might be considered low. However, a handful of rules were generated for this item. Of note, those that have cancer are also associated with having hypertension and cancer. An interesting observation is that drinking milk occurs in multiple rules.
Next, we inspected rules where individual has cancer and observed which small itemsets. The confidence level was set even lower to generate results for small itemsets in conjunction with cancer. Again, we similar items such having diabetes and hypertension appear in the small itemsets.
In order to build the association mining lists, we had to reduce confidence levels to under 0.5.
As with cancer association rules, we will examine both large and small items found in conjunction with an individual having diabetes.
For large itemset with a positive diabetes results, we were able to increase the confidence level to 0.7. 32 rules were generated for this result. Of note, a household income between “20000-24999”daily/weekly milk consumption and appears in several rules. Also interesting, that there are rules where an individual has health insurance coverage. None of the rules contain the opposite condition of not having health insurance coverage.
For small itemsets, all the rules include having cancer in association with diabetes.
First, we inspected the association rules with large itemsets for those individuals with hypertension. Rules with confidence levels of 1 are also found within this itemset. Unlike the previous 2 health conditions, race is appearing more prominently within the association rules.
Second, we inspected the association rules with small itemsets for those individuals with hypertension. A martial value of “widowed” appears more frequently than other martial values within the rules.
For rules we examined in the previous section, we’ve taken the top 20 rules and created interactive scatter plots and graphs to visualize data.
The following is a scatter graph for visualizing the top 20 association rules for cancer with large itemsets. Please note, the points on the graph are interactive, please cursor over points to see association rule.
CANCER (large itemsets)
The following is a scatter graph for visualizing the top 20 association rules for cancer with small itemsets.CANCER (small itemsets)
The following is a scatter plot for visualizing the top 20 association rules for diabetes with large itemsets.
DIABETES (large itemsets)
The following is a scatter plot for visualizing the top 20 association rules for diabetes with small itemsets.
DIABETES (small itemsets)
The following is a scatter plot for visualizing the top 20 association rules for hypertension with large itemsets.
HYPER TENSION (large itemsets)
The following is a scatter plot for visualizing the top 20 association rules for hypertension with small itemsets.
HYPER TENSION (small itemsets)
The following graphs are interactive. Hover the cursor over the rule, to see the related values. Hover the cursor over a value, to see the related rules.
The following is a graph for visualizing the top 20 association rules for cancer with large itemsets.
CANCER (large itemset)
The following is a graph for visualizing the top 20 association rules for cancer with small itemsets.CANCER (small itemset)
The following is a graph for visualizing the top 20 association rules for diabetes with large itemsets.
DIABETES (large itemset)
The following is a graph for visualizing the top 20 association rules for diabetes with small itemsets.DIABETES (small itemset)
The following is a graph for visualizing the top 20 association rules for hypertension with large itemsets.
HYPERTENSION (large itemset)
The following is a graph for visualizing the top 20 association rules for hypertension with small itemsets.
HYPERTENSION (small itemset)
With the association rules for cancer, we’ve plotted the top 20 values that were represented in the itemsets.
With the association rules for diabetes, we’ve plotted the top 20 values that were represented in the itemsets.
With the association rules for hypertension, we’ve plotted the top 20 values that were represented in the itemsets.
In the preceding section, we looked at associations between having diseases/health conditions and other values. To complement our findings, we decided to also create association rules for not having the diseases. This might yield beneficial findings and support any findings from the previous association rules involving positive values for diseases.
The following rules were used:
The rules for not having Cancer
Below, we’ve listed the top 20 association rules for not having cancer. Of interest, it appears “no health insurance” and “visiting multiple places for healthcare” appears in many rules that lead to not having cancer.
The rules for not having Diabetes
Below, we’ve listed the top 20 association rules for not having diabetes. There are multiple rules with the value where the individual receives health from various places (as opposed to one location).
The rules for not having Hypertension
Below, we’ve listed the top 20 association rules for not having hypertension. Of interest, values smokers and non-smokers both appear in the results. Many of the video game related values also appear in the rules below.
As we’ve gathered data for having diseases and not having diseases, we’ve attempted to gather insights from the findings that could provide business value to the marketing department as per the defined initial business problem. Please note the association rules do not establish causation. These association rules are only to highlight values that are associated or appear together. And our conclusion is subjective based on our interpretation of the data.
These association rules show what related items are found in conjunction with having different diseases and health conditions. Below, we will discuss some of our findings:
The value associated with drinking milk multiple times a day or week appears several times in diabetes and cancer conditions. Additionally, the values for drinking milk do not appear in the association rules for not having cancer/diabetes. It might be valid to position marketing for drugs on cancer/diabetes in conjunction with milk placement. For example, youtube video often place advertisements in pairs. Then, we could place a cancer drug advertisement appear after a milk advertisement in a youtube video. Please we are not suggesting that milk usage causes cancer. We are making a suggestion an association that is within the data.
For hypertension, many of the associated values for income are under $24,999. For marketing, placement of billboards in areas where salaries are under $24,999 could be helpful to market drugs towards those with hypertension.
Although, our business is focused on marketing drugs to patients for cancer, diabetes, and hypertension. We can look for out of the box solutions. If the business was looking to develop drugs (or supplements) related to the prevention of hypertension, we could use data to identify associations with audiences that do not already have a disease.
In summary, these are a few of the suggestions that could be derived from the data. We think these suggestions could have value and provide the “so what” for our conclusions.
The association models used in the preceding sections contained 18 variables. We will provide our results to the business. However, the associations rules could be improved by adding more categorical variables or numerical variables (which have been binned). The activity of recoding and binning values from the raw data increases the overhead of adding more attributes. However, in the event, the business is intrigued by the findings, more data can be incorporated in the association ruleset.
Within the data for not having diseases, frequently, the condition not having medical insurance appears multiple times. Is this an indication, people without medical insurance are truly not associated with the diseases? Recall the data for the field is based on a questionnaire that presuppositions, the individual has seen a doctor. If the individual has not seen a doctor for diagnose due to health insurance coverage, then they may not have been able to accurately ascertain whether they have a particular disease.
The marketing department is struggling with high costs of television advertisements and is interested in ways to reduce their costs while still hitting their target markets for both the advertisement of drugs and attracting candidates for trails.
We only used the demographics database to avoid potential HIPAA breaches. The features below were selected to assist the marketing department with their market segmentation efforts:
There were two columns for education, one that breaks down the elementary studies of the participants and another that more broadly indicates higher levels of education. We are not interested in such a level of granularity and proceeded to merge both columns and reduced the number of factors to mean “Highest level of education achieved”, this helped reduce the missing values from over 40% in each column to under 17%.
For consistency, the Age feature was converted to categorical.
Check that features have the appropiate class
Impute missing values
Check that missing values are below 25%.
Hierarchical clustering was chosen due to the features being categorical
The first plot is a tally of how many observations there are in each cluster. Subsequent plots show the distribution of the features among each cluster. All the plots are shown after the code.
Although the data appears to be very homogenous, with many of the clusters having similar proportions. There are two clusters, 7 and 8, that encompass more observations. The data from these two would be recommended to the marketing department for further analysis.
The model used in the ShinyApp to precit if a patient was changes to contract cancer is Random Forest, give it best performance.
The ShinyApp can be accessed from: https://ml1000-group6.shinyapps.io/NAHNES2/